This vignette shows how to use the basic functionalities of the gplite package. We’ll start with a simple regression example, and then illustrate the usage with non-Gaussian likelihoods. This vignette assumes that you’re already familiar Gaussian processes (GPs), and so we shall not focus on the mathematical details. To read more about GPs, see Rasmussen and Williams (2006).

1 Regression

library(gplite)
library(ggplot2)
theme_set(theme_classic())

Let us simulate some toy regression data.

# create some toy 1d regression data
set.seed(32004)
n <- 200
sigma <- 0.1
x <- rnorm(n)
y <- sin(3*x)*exp(-abs(x)) +  rnorm(n)*sigma 
ggplot() + 
  geom_point(aes(x=x,y=y), size=0.5) + 
  xlab('x') + ylab('y') + xlim(-4,4)

Set up the model with squared exponential covariance function and Gaussian likelihood. With a Gaussian observation model, we can compute the posterior analytically for a given set of hyperparameters.

So let’s first create he model. We’ll use the squared exponential covariance function for this example.

gp <- gp_init(cfs = cf_sexp(), lik = lik_gaussian())

There’s three hyperparameters in the model. We can see the initial values using get_param. Notice that this function returns the parametes on log-scale, so we take the exponent to get the actual values

exp(get_param(gp))
##     cf_sexp.lscale       cf_sexp.magn lik_gaussian.sigma 
##                0.3                1.0                0.5

We can optimize the hyperparameters to the maximum marginal likelihood solution using gp_optim.

gp <- gp_optim(gp, x, y)

Let’s print out the hyperparameters after optimization:

exp(get_param(gp))
##     cf_sexp.lscale       cf_sexp.magn lik_gaussian.sigma 
##         0.42449158         0.33185388         0.09841499

We can easily make predictions with the fitted model using function gp_pred. So let’s predict and visualize the results

# compute the predictive mean and variance in a grid of points
xt <- seq(-4,4,len=300)
pred <- gp_pred(gp, xt, var=T)

# visualize
mu <- pred$mean
lb <- pred$mean - 2*sqrt(pred$var)
ub <- pred$mean + 2*sqrt(pred$var)
ggplot() + 
  geom_ribbon(aes(x=xt, ymin=lb, ymax=ub), fill='lightgray') +
  geom_line(aes(x=xt, y=mu), size=1) +
  geom_point(aes(x=x, y=y), size=0.5) +
  xlab('x') + ylab('y')

2 Binary classification

# create 1d toy binary classification data
set.seed(32004)
n <- 150
sigma <- 0.1
x <- rnorm(n)
ycont <- sin(3*x)*exp(-abs(x)) +  rnorm(n)*sigma
y <- rep(0,n)
y[ycont > 0] <- 1

ggplot() + 
  geom_point(aes(x=x,y=y), size=0.5) + 
  xlab('x') + ylab('y')

Set up the model and optimize the hyperparameters. Notice that due to non-Gaussian likelihood, the posterior is not analytically available, so this will use Laplace approximation

gp <- gp_init(cfs = cf_sexp(), lik = lik_bernoulli())
gp <- gp_optim(gp, x, y)

Let us visualize the predictive fit. For a non-Gaussian likelihood, this is most conveniently done by drawing from the predictive distribution using gp_draw, and then computing the intervals from the obtained draws.

# predict in a grid of points
xt <- seq(-4,4,len=200)
pred <- gp_draw(gp, xt, draws=1000)
pred_mean <- rowMeans(pred)
qt <- apply(pred, 1, quantile, c(0.05, 0.95))
pred_lb <- qt[1,]
pred_ub <- qt[2,]

ggplot() + 
  geom_ribbon(aes(x=xt,ymin=pred_lb,ymax=pred_ub), fill='lightgray') +
  geom_line(aes(x=xt,y=pred_mean), color='black', size=1) +
  geom_point(aes(x=x,y=y), color='black', size=0.5) +
  xlab('x') + ylab('y')

The fit looks reasonable, but one could expect that the predictive probabilities would be even closer to zero and one around the origin, given the data we see. Indeed, it is well known that the Laplace approximation tends to be too conservative and underestimate the extreme probabilities in this sort of cases (see, for example, Kuss and Rasmussen 2005).

For more accurate inference, we can run MCMC for the latent values with the optimized hyperparameters. To use more than one core for the computation, uncomment the first line.

#options(mc.cores = parallel::detectCores())
gpmc <- gp_mcmc(gp, x, y, chains=2, iter=1000)

Let’s visualize the fit again

# predict in a grid of points
xt <- seq(-4,4,len=200)
pred <- gp_draw(gpmc, xt)
pred_mean <- rowMeans(pred)
qt <- apply(pred, 1, quantile, c(0.05, 0.95))
pred_lb <- qt[1,]
pred_ub <- qt[2,]

ggplot() + 
  geom_ribbon(aes(x=xt,ymin=pred_lb,ymax=pred_ub), fill='lightgray') +
  geom_line(aes(x=xt,y=pred_mean), color='black', size=1) +
  geom_point(aes(x=x,y=y), color='black', size=0.5) +
  xlab('x') + ylab('y')

As we see, the fit is substantially different in those cases where the predictive probability is close to zero or one. It is worthwhile to note that even though the Laplace approximation is not very accurate in the extreme cases, it gives reasonable estimates for the hyperparameters. Running then MCMC for the latent values (given the hyperparameters) gives more accurate estimates for the latent function values.

3 Sparse approximations

The package provides a few methods for handling bigger datasets for which the full GP inference is infeasible. This section illustrates with a synthetic toy example.

# 2d data, version 2
set.seed(2)
n_per_cluster <- 2000
d <- 2
drel <- 2
x <- 0.5*cbind( matrix(rnorm(n_per_cluster*drel), nrow=drel) + 3*rep(1,drel), 
                matrix(rnorm(n_per_cluster*drel), nrow=drel) - 0*rep(1,drel),
                matrix(rnorm(n_per_cluster*drel), nrow=drel) + c(-3,3))
x <- t(x)
y <- c(rep(0,n_per_cluster), rep(1,n_per_cluster), rep(0,n_per_cluster))
n <- length(y) 

# plot 
ggplot() +
  geom_point(data=data.frame(x=x[,1],y=x[,2]), aes(x=x,y=y), color=y+2, size=1)

3.1 FITC

The package implements a generic sparse approximation which is known as the fully independent training conditional, or FITC (Quiñonero-Candela and Rasmussen 2005; Snelson and Ghahramani 2006). The accuracy of the approximation is determined by the number of inducing points. Notice that we use here only a relatively small number of inducing points to make this vignette build fast. In this toy example it does not affect the accuracy that much, but in more complicated example many more inducing points might be needed to get reasonable accuracy.

# fit the model
gp <- gp_init(lik=lik_bernoulli(), cfs=cf_sexp(), method = 'fitc', num_inducing=50)
gp <- gp_optim(gp,x,y)

Let’s visualize the predictions.

# predict
ng <- 20
x1g <- seq(-4,4,len=ng)
x2g <- seq(-2,4,len=ng)

xnew <- cbind( rep(x1g,each=ng), rep(x2g,ng) )
pred <- gp_draw(gp, xnew, draws=500)
pr <- rowMeans(pred)

# visualize
ggplot() + 
  geom_contour(data=data.frame(x=xnew[,1], y=xnew[,2], pr=pr),
                aes(x=x, y=y, z=pr, colour=..level..) ) +
  scale_colour_gradient(low = "red", high = "green", guide='none') +
  geom_point(data=data.frame(x=x[,1],y=x[,2]), aes(x=x,y=y), color=y+2, size=1)

The fit seems reasonably good even with only a small number of inducing points.

3.2 Random features

The package also implements the method of random features (Rahimi and Recht 2008). The accuracy of the approximation is determined by the number of random features (also called basis functions). Again, we use a relatively small number of features, which works fine here, but in more complicated problems many more might be needed for reasonable accuracy.

# fit the model
gp <- gp_init(cfs=cf_sexp(), lik=lik_bernoulli(), method = 'rf', num_basis=100)
gp <- gp_optim(gp,x,y)

Let’s again visualize the predictions.

# predict
ng <- 20
x1g <- seq(-4,4,len=ng)
x2g <- seq(-2,4,len=ng)

xnew <- cbind( rep(x1g,each=ng), rep(x2g,ng) )
pred <- gp_draw(gp, xnew, draws=500)
pr <- rowMeans(pred)

# visualize
ggplot() + 
  geom_contour(data=data.frame(x=xnew[,1], y=xnew[,2], pr=pr),
                aes(x=x, y=y, z=pr, colour=..level..) ) +
  scale_colour_gradient(low = "red", high = "green", guide='none') +
  geom_point(data=data.frame(x=x[,1],y=x[,2]), aes(x=x,y=y), color=y+2, size=1)

The fit seems more or less reasonable, albeit less accurate than the one with FITC. As a rule of thumb, the random features approach usually requires more basis functions than the FITC requires inducing inputs for comparable accuracy.

References

Kuss, Malte, and Carl Edward Rasmussen. 2005. “Assessing Approximate Inference for Binary Gaussian Process Classification.” Journal of Machine Learning Research 6: 1679–1704.

Quiñonero-Candela, Joaquin, and Carl E. Rasmussen. 2005. “A Unifying View of Sparse Approximate Gaussian Process Regression.” Journal of Machine Learning Research 6: 1939–59.

Rahimi, Ali, and Benjamin Recht. 2008. “Random Features for Large-Scale Kernel Machines.” In Advances in Neural Information Processing Systems 20, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 1177–84.

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. The MIT Press. http://www.gaussianprocess.org/gpml/.

Snelson, Edward, and Zoubin Ghahramani. 2006. “Sparse Gaussian Processes Using Pseudo-Inputs.” In Advances in Neural Information Processing Systems 18, edited by Y. Weiss, B. Schölkopf, and J. C. Platt, 1257–64.